{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3",
   "language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "source": [
    "<center><h1>第八章 文本数据</h1></center>"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "source": [
    "## 一、str对象\n",
    "### 1. str对象的设计意图\n",
    "\n",
    "`str`对象是定义在`Index`或`Series`上的属性，专门用于处理每个元素的文本内容，其内部定义了大量方法，因此对一个序列进行文本处理，首先需要获取其`str`对象。在Python标准库中也有`str`模块，为了使用上的便利，有许多函数的用法`pandas`照搬了它的设计，例如字母转为大写的操作："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "'ABCD'"
      ]
     },
     "metadata": {},
     "execution_count": 2
    }
   ],
   "source": [
    "var = 'abcd'\n",
    "str.upper(var) # Python内置str模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "<pandas.core.strings.accessor.StringMethods at 0x1488ea6db08>"
      ]
     },
     "metadata": {},
     "execution_count": 3
    }
   ],
   "source": [
    "s = pd.Series(['abcd', 'efg', 'hi'])\n",
    "s.str"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    ABCD\n",
       "1     EFG\n",
       "2      HI\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 4
    }
   ],
   "source": [
    "s.str.upper() # pandas中str对象上的upper方法"
   ]
  },
  {
   "source": [
    "根据文档`API`材料，在`pandas`的50个`str`对象方法中，有31个是和标准库中的`str`模块方法同名且功能一致，这为批量处理序列提供了有力的工具。\n",
    "\n",
    "### 2. []索引器\n",
    "\n",
    "对于`str`对象而言，可理解为其对字符串进行了序列化的操作，例如在一般的字符串中，通过`[]`可以取出某个位置的元素："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "'a'"
      ]
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "source": [
    "var[0]"
   ]
  },
  {
   "source": [
    "同时也能通过切片得到子串："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "'db'"
      ]
     },
     "metadata": {},
     "execution_count": 6
    }
   ],
   "source": [
    "var[-1: 0: -2]"
   ]
  },
  {
   "source": [
    "通过对`str`对象使用`[]`索引器，可以完成完全一致的功能，并且如果超出范围则返回缺失值："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    a\n",
       "1    e\n",
       "2    h\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 7
    }
   ],
   "source": [
    "s.str[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    db\n",
       "1     g\n",
       "2     i\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 8
    }
   ],
   "source": [
    "s.str[-1: 0: -2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0      c\n",
       "1      g\n",
       "2    NaN\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 9
    }
   ],
   "source": [
    "s.str[2]"
   ]
  },
  {
   "source": [
    "### 3. string类型\n",
    "\n",
    "在上一章提到，从`pandas`的`1.0.0`版本开始，引入了`string`类型，其引入的动机在于：原来所有的字符串类型都会以`object`类型的`Series`进行存储，但`object`类型只应当存储混合类型，例如同时存储浮点、字符串、字典、列表、自定义类型等，因此字符串有必要同数值型或`category`一样，具有自己的数据存储类型，从而引入了`string`类型。\n",
    "\n",
    "总体上说，绝大多数对于`object`和`string`类型的序列使用`str`对象方法产生的结果是一致，但是在下面提到的两点上有较大差异：\n",
    "\n",
    "首先，应当尽量保证每一个序列中的值都是字符串的情况下才使用`str`属性，但这并不是必须的，其必要条件是序列中至少有一个可迭代（Iterable）对象，包括但不限于字符串、字典、列表。对于一个可迭代对象，`string`类型的`str`对象和`object`类型的`str`对象返回结果可能是不同的。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    temp_1\n",
       "1         b\n",
       "2       NaN\n",
       "3         y\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 10
    }
   ],
   "source": [
    "s = pd.Series([{1: 'temp_1', 2: 'temp_2'}, ['a', 'b'], 0.5, 'my_string'])\n",
    "s.str[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    1\n",
       "1    '\n",
       "2    .\n",
       "3    y\n",
       "dtype: string"
      ]
     },
     "metadata": {},
     "execution_count": 11
    }
   ],
   "source": [
    "s.astype('string').str[1]"
   ]
  },
  {
   "source": [
    "除了最后一个字符串元素，前三个元素返回的值都不同，其原因在于当序列类型为`object`时，是对于每一个元素进行`[]`索引，因此对于字典而言，返回temp_1字符串，对于列表则返回第二个值，而第三个为不可迭代对象，返回缺失值，第四个是对字符串进行`[]`索引。而`string`类型的`str`对象先把整个元素转为字面意义的字符串，例如对于列表而言，第一个元素即 \"{\"，而对于最后一个字符串元素而言，恰好转化前后的表示方法一致，因此结果和`object`类型一致。\n",
    "\n",
    "除了对于某些对象的`str`序列化方法不同之外，两者另外的一个差别在于，`string`类型是`Nullable`类型，但`object`不是。这意味着`string`类型的序列，如果调用的`str`方法返回值为整数`Series`和布尔`Series`时，其分别对应的`dtype`是`Int`和`boolean`的`Nullable`类型，而`object`类型则会分别返回`int/float`和`bool/object`，取决于缺失值的存在与否。同时，字符串的比较操作，也具有相似的特性，`string`返回`Nullable`类型，但`object`不会。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    1\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "execution_count": 12
    }
   ],
   "source": [
    "s = pd.Series(['a'])\n",
    "s.str.len()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    1\n",
       "dtype: Int64"
      ]
     },
     "metadata": {},
     "execution_count": 13
    }
   ],
   "source": [
    "s.astype('string').str.len()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    True\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 14
    }
   ],
   "source": [
    "s == 'a'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    True\n",
       "dtype: boolean"
      ]
     },
     "metadata": {},
     "execution_count": 15
    }
   ],
   "source": [
    "s.astype('string') == 'a'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = pd.Series(['a', np.nan]) # 带有缺失值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    1.0\n",
       "1    NaN\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 17
    }
   ],
   "source": [
    "s.str.len()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0       1\n",
       "1    <NA>\n",
       "dtype: Int64"
      ]
     },
     "metadata": {},
     "execution_count": 18
    }
   ],
   "source": [
    "s.astype('string').str.len()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 19
    }
   ],
   "source": [
    "s == 'a'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    True\n",
       "1    <NA>\n",
       "dtype: boolean"
      ]
     },
     "metadata": {},
     "execution_count": 20
    }
   ],
   "source": [
    "s.astype('string') == 'a'"
   ]
  },
  {
   "source": [
    "最后需要注意的是，对于全体元素为数值类型的序列，即使其类型为`object`或者`category`也不允许直接使用`str`属性。如果需要把数字当成`string`类型处理，可以使用`astype`强制转换为`string`类型的`Series`："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    2\n",
       "1    4\n",
       "2    7\n",
       "dtype: string"
      ]
     },
     "metadata": {},
     "execution_count": 21
    }
   ],
   "source": [
    "s = pd.Series([12, 345, 6789])\n",
    "s.astype('string').str[1]"
   ]
  },
  {
   "source": [
    "## 二、正则表达式基础\n",
    "\n",
    "这一节的两个表格来自于[learn-regex-zh](https://github.com/cdoco/learn-regex-zh)这个关于正则表达式项目，其使用`MIT`开源许可协议。这里只是介绍正则表达式的基本用法，需要系统学习的读者可参考[正则表达式必知必会](https://book.douban.com/subject/26285406/)一书。\n",
    "\n",
    "### 1. 一般字符的匹配\n",
    "\n",
    "正则表达式是一种按照某种正则模式，从左到右匹配字符串中内容的一种工具。对于一般的字符而言，它可以找到其所在的位置，这里为了演示便利，使用了`python`中`re`模块的`findall`函数来匹配所有出现过但不重叠的模式，第一个参数是正则表达式，第二个参数是待匹配的字符串。例如，在下面的字符串中找出`apple`："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['Apple', 'Apple']"
      ]
     },
     "metadata": {},
     "execution_count": 22
    }
   ],
   "source": [
    "import re\n",
    "re.findall(r'Apple', 'Apple! This Is an Apple!')"
   ]
  },
  {
   "source": [
    "### 2. 元字符基础\n",
    "|元字符 |   描述 |\n",
    "| :-----| ----: |\n",
    "|.       |    匹配除换行符以外的任意字符|\n",
    "|\\[ \\]     |      字符类，匹配方括号中包含的任意字符|\n",
    "|\\[^ \\]     |      否定字符类，匹配方括号中不包含的任意字符|\n",
    "|\\*       |    匹配前面的子表达式零次或多次|\n",
    "|\\+       |    匹配前面的子表达式一次或多次|\n",
    "|?        |   匹配前面的子表达式零次或一次|\n",
    "|{n,m}    |       花括号，匹配前面字符至少 n 次，但是不超过 m 次|\n",
    "|(xyz)   |        字符组，按照确切的顺序匹配字符xyz|\n",
    "|\\|     |      分支结构，匹配符号之前的字符或后面的字符|\n",
    "|\\\\    |       转义符，它可以还原元字符原来的含义|\n",
    "|^    |       匹配行的开始|\n",
    "|$   |        匹配行的结束|"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['a', 'b', 'c']"
      ]
     },
     "metadata": {},
     "execution_count": 23
    }
   ],
   "source": [
    "re.findall(r'.', 'abc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['a', 'c']"
      ]
     },
     "metadata": {},
     "execution_count": 24
    }
   ],
   "source": [
    "re.findall(r'[ac]', 'abc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['b']"
      ]
     },
     "metadata": {},
     "execution_count": 25
    }
   ],
   "source": [
    "re.findall(r'[^ac]', 'abc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['aa', 'aa', 'bb', 'bb']"
      ]
     },
     "metadata": {},
     "execution_count": 26
    }
   ],
   "source": [
    "re.findall(r'[ab]{2}', 'aaaabbbb') # {n}指匹配n次"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['aaa', 'bbb']"
      ]
     },
     "metadata": {},
     "execution_count": 27
    }
   ],
   "source": [
    "re.findall(r'aaa|bbb', 'aaaabbbb')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['a', 'a', 'a', 'a']"
      ]
     },
     "metadata": {},
     "execution_count": 28
    }
   ],
   "source": [
    "re.findall(r'a\\\\?|a\\*', 'aa?a*a')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['ab', 'aa', 'c', 'ad', 'aa', 'e']"
      ]
     },
     "metadata": {},
     "execution_count": 29
    }
   ],
   "source": [
    "re.findall(r'a?.', 'abaacadaae')"
   ]
  },
  {
   "source": [
    "### 3. 简写字符集\n",
    "此外，正则表达式中还有一类简写字符集，其等价于一组字符的集合：\n",
    "\n",
    "|简写    |  描述 |\n",
    "| :-----| :---- |\n",
    "|\\\\w     |   匹配所有字母、数字、下划线: \\[a-zA-Z0-9\\_\\] |\n",
    "|\\\\W     |   匹配非字母和数字的字符: \\[^\\\\w\\]|\n",
    "|\\\\d     |   匹配数字: \\[0-9\\]|\n",
    "|\\\\D   |     匹配非数字: \\[^\\\\d\\]|\n",
    "|\\\\s    |    匹配空格符: \\[\\\\t\\\\n\\\\f\\\\r\\\\p{Z}\\]|\n",
    "|\\\\S    |    匹配非空格符: \\[^\\\\s\\]|\n",
    "|\\\\B  |      匹配一组非空字符开头或结尾的位置，不代表具体字符|"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['is', 'Is']"
      ]
     },
     "metadata": {},
     "execution_count": 30
    }
   ],
   "source": [
    "re.findall(r'.s', 'Apple! This Is an Apple!')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['09', '7w', 'c_', '9q']"
      ]
     },
     "metadata": {},
     "execution_count": 31
    }
   ],
   "source": [
    "re.findall(r'\\w{2}', '09 8? 7w c_ 9q p@')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['8?', 'p@']"
      ]
     },
     "metadata": {},
     "execution_count": 32
    }
   ],
   "source": [
    "re.findall(r'\\w\\W\\B', '09 8? 7w c_ 9q p@')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['t d', 'g w', 's t', 'e s']"
      ]
     },
     "metadata": {},
     "execution_count": 33
    }
   ],
   "source": [
    "re.findall(r'.\\s.', 'Constant dropping wears the stone.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "[('黄浦区', '方浜中路', '249号'), ('宝山区', '密山路', '5号')]"
      ]
     },
     "metadata": {},
     "execution_count": 34
    }
   ],
   "source": [
    "re.findall(r'上海市(.{2,3}区)(.{2,3}路)(\\d+号)', '上海市黄浦区方浜中路249号 上海市宝山区密山路5号')"
   ]
  },
  {
   "source": [
    "## 三、文本处理的五类操作\n",
    "### 1. 拆分\n",
    "\n",
    "`str.split`能够把字符串的列进行拆分，其中第一个参数为正则表达式，可选参数包括从左到右的最大拆分次数`n`，是否展开为多个列`expand`。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    [上海, 黄浦, 方浜中, 249号]\n",
       "1       [上海, 宝山, 密山, 5号]\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 35
    }
   ],
   "source": [
    "s = pd.Series(['上海市黄浦区方浜中路249号', '上海市宝山区密山路5号'])\n",
    "s.str.split('[市区路]')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "    0   1         2\n",
       "0  上海  黄浦  方浜中路249号\n",
       "1  上海  宝山     密山路5号"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n      <th>1</th>\n      <th>2</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>上海</td>\n      <td>黄浦</td>\n      <td>方浜中路249号</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>上海</td>\n      <td>宝山</td>\n      <td>密山路5号</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 36
    }
   ],
   "source": [
    "s.str.split('[市区路]', n=2, expand=True)"
   ]
  },
  {
   "source": [
    "与其类似的函数是`str.rsplit`，其区别在于使用`n`参数的时候是从右到左限制最大拆分次数。但是当前版本下`rsplit`因为`bug`而无法使用正则表达式进行分割："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "                0\n",
       "0  上海市黄浦区方浜中路249号\n",
       "1     上海市宝山区密山路5号"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>上海市黄浦区方浜中路249号</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>上海市宝山区密山路5号</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 37
    }
   ],
   "source": [
    "s.str.rsplit('[市区路]', n=2, expand=True)"
   ]
  },
  {
   "source": [
    "### 2. 合并\n",
    "\n",
    "关于合并一共有两个函数，分别是`str.join`和`str.cat`。`str.join`表示用某个连接符把`Series`中的字符串列表连接起来，如果列表中出现了非字符串元素则返回缺失值："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    a-b\n",
       "1    NaN\n",
       "2    NaN\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 38
    }
   ],
   "source": [
    "s = pd.Series([['a','b'], [1, 'a'], [['a', 'b'], 'c']])\n",
    "s.str.join('-')"
   ]
  },
  {
   "source": [
    "`str.cat`用于合并两个序列，主要参数为连接符`sep`、连接形式`join`以及缺失值替代符号`na_rep`，其中连接形式默认为以索引为键的左连接。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    a-cat\n",
       "1    b-dog\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 39
    }
   ],
   "source": [
    "s1 = pd.Series(['a','b'])\n",
    "s2 = pd.Series(['cat','dog'])\n",
    "s1.str.cat(s2,sep='-')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0      a-?\n",
       "1    b-cat\n",
       "2    ?-dog\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 40
    }
   ],
   "source": [
    "s2.index = [1, 2]\n",
    "s1.str.cat(s2, sep='-', na_rep='?', join='outer')"
   ]
  },
  {
   "source": [
    "### 3. 匹配\n",
    "\n",
    "`str.contains`返回了每个字符串是否包含正则模式的布尔序列："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0     True\n",
       "1     True\n",
       "2    False\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 41
    }
   ],
   "source": [
    "s = pd.Series(['my cat', 'he is fat', 'railway station'])\n",
    "s.str.contains('\\s\\wat')"
   ]
  },
  {
   "source": [
    "`str.startswith`和`str.endswith`返回了每个字符串以给定模式为开始和结束的布尔序列，它们都不支持正则表达式："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2    False\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 42
    }
   ],
   "source": [
    "s.str.startswith('my')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0     True\n",
       "1     True\n",
       "2    False\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 43
    }
   ],
   "source": [
    "s.str.endswith('t')"
   ]
  },
  {
   "source": [
    "如果需要用正则表达式来检测开始或结束字符串的模式，可以使用`str.match`，其返回了每个字符串起始处是否符合给定正则模式的布尔序列："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0     True\n",
       "1     True\n",
       "2    False\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 44
    }
   ],
   "source": [
    "s.str.match('m|h')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    False\n",
       "1     True\n",
       "2     True\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 45
    }
   ],
   "source": [
    "s.str[::-1].str.match('ta[f|g]|n') # 反转后匹配"
   ]
  },
  {
   "source": [
    "当然，这些也能通过在`str.contains`的正则中使用`^`和`$`来实现："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0     True\n",
       "1     True\n",
       "2    False\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 46
    }
   ],
   "source": [
    "s.str.contains('^[m|h]')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    False\n",
       "1     True\n",
       "2     True\n",
       "dtype: bool"
      ]
     },
     "metadata": {},
     "execution_count": 47
    }
   ],
   "source": [
    "s.str.contains('[f|g]at|n$')"
   ]
  },
  {
   "source": [
    "除了上述返回值为布尔的匹配之外，还有一种返回索引的匹配函数，即`str.find`与`str.rfind`，其分别返回从左到右和从右到左第一次匹配的位置的索引，未找到则返回-1。需要注意的是这两个函数不支持正则匹配，只能用于字符子串的匹配："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    11\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "execution_count": 48
    }
   ],
   "source": [
    "s = pd.Series(['This is an apple. That is not an apple.'])\n",
    "s.str.find('apple')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    33\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "execution_count": 49
    }
   ],
   "source": [
    "s.str.rfind('apple')"
   ]
  },
  {
   "source": [
    "### 4. 替换\n",
    "\n",
    "`str.replace`和`replace`并不是一个函数，在使用字符串替换时应当使用前者。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    a_new_b\n",
       "1      c_new\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 50
    }
   ],
   "source": [
    "s = pd.Series(['a_1_b','c_?'])\n",
    "s.str.replace('\\d|\\?', 'new', regex=True)"
   ]
  },
  {
   "source": [
    "当需要对不同部分进行有差别的替换时，可以利用`子组`的方法，并且此时可以通过传入自定义的替换函数来分别进行处理，注意`group(k)`代表匹配到的第`k`个子组（圆括号之间的内容）："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    Shanghai HP District Mid Fangbin Road No. 249\n",
       "1           Shanghai BS District Mishan Road No. 5\n",
       "2           Beijing CP District Beinong Road No. 2\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 51
    }
   ],
   "source": [
    "s = pd.Series(['上海市黄浦区方浜中路249号',\n",
    "                '上海市宝山区密山路5号',\n",
    "                '北京市昌平区北农路2号'])\n",
    "pat = '(\\w+市)(\\w+区)(\\w+路)(\\d+号)'\n",
    "city = {'上海市': 'Shanghai', '北京市': 'Beijing'}\n",
    "district = {'昌平区': 'CP District',\n",
    "            '黄浦区': 'HP District',\n",
    "            '宝山区': 'BS District'}\n",
    "road = {'方浜中路': 'Mid Fangbin Road',\n",
    "        '密山路': 'Mishan Road',\n",
    "        '北农路': 'Beinong Road'}\n",
    "def my_func(m):\n",
    "    str_city = city[m.group(1)]\n",
    "    str_district = district[m.group(2)]\n",
    "    str_road = road[m.group(3)]\n",
    "    str_no = 'No. ' + m.group(4)[:-1]\n",
    "    return ' '.join([str_city,\n",
    "                     str_district,\n",
    "                     str_road,\n",
    "                     str_no])\n",
    "s.str.replace(pat, my_func, regex=True)"
   ]
  },
  {
   "source": [
    "这里的数字标识并不直观，可以使用`命名子组`更加清晰地写出子组代表的含义："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    Shanghai HP District Mid Fangbin Road No. 249\n",
       "1           Shanghai BS District Mishan Road No. 5\n",
       "2           Beijing CP District Beinong Road No. 2\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 52
    }
   ],
   "source": [
    "pat = '(?P<市名>\\w+市)(?P<区名>\\w+区)(?P<路名>\\w+路)(?P<编号>\\d+号)'\n",
    "def my_func(m):\n",
    "    str_city = city[m.group('市名')]\n",
    "    str_district = district[m.group('区名')]\n",
    "    str_road = road[m.group('路名')]\n",
    "    str_no = 'No. ' + m.group('编号')[:-1]\n",
    "    return ' '.join([str_city,\n",
    "                     str_district,\n",
    "                     str_road,\n",
    "                     str_no])\n",
    "s.str.replace(pat, my_func, regex=True)"
   ]
  },
  {
   "source": [
    "这里虽然看起来有些繁杂，但是实际数据处理中对应的替换，一般都会通过代码来获取数据从而构造字典映射，在具体写法上会简洁的多。\n",
    "\n",
    "### 5. 提取\n",
    "\n",
    "提取既可以认为是一种返回具体元素（而不是布尔值或元素对应的索引位置）的匹配操作，也可以认为是一种特殊的拆分操作。前面提到的`str.split`例子中会把分隔符去除，这并不是用户想要的效果，这时候就可以用`str.extract`进行提取："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "     0    1     2     3\n",
       "0  上海市  黄浦区  方浜中路  249号\n",
       "1  上海市  宝山区   密山路    5号\n",
       "2  北京市  昌平区   北农路    2号"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>0</th>\n      <th>1</th>\n      <th>2</th>\n      <th>3</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>上海市</td>\n      <td>黄浦区</td>\n      <td>方浜中路</td>\n      <td>249号</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>上海市</td>\n      <td>宝山区</td>\n      <td>密山路</td>\n      <td>5号</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>北京市</td>\n      <td>昌平区</td>\n      <td>北农路</td>\n      <td>2号</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 53
    }
   ],
   "source": [
    "pat = '(\\w+市)(\\w+区)(\\w+路)(\\d+号)'\n",
    "s.str.extract(pat)"
   ]
  },
  {
   "source": [
    "通过子组的命名，可以直接对新生成`DataFrame`的列命名："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "    市名   区名    路名    编号\n",
       "0  上海市  黄浦区  方浜中路  249号\n",
       "1  上海市  宝山区   密山路    5号\n",
       "2  北京市  昌平区   北农路    2号"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>市名</th>\n      <th>区名</th>\n      <th>路名</th>\n      <th>编号</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>上海市</td>\n      <td>黄浦区</td>\n      <td>方浜中路</td>\n      <td>249号</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>上海市</td>\n      <td>宝山区</td>\n      <td>密山路</td>\n      <td>5号</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>北京市</td>\n      <td>昌平区</td>\n      <td>北农路</td>\n      <td>2号</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 54
    }
   ],
   "source": [
    "pat = '(?P<市名>\\w+市)(?P<区名>\\w+区)(?P<路名>\\w+路)(?P<编号>\\d+号)'\n",
    "s.str.extract(pat)"
   ]
  },
  {
   "source": [
    "`str.extractall`不同于`str.extract`只匹配一次，它会把所有符合条件的模式全部匹配出来，如果存在多个结果，则以多级索引的方式存储："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "              0   1\n",
       "     match         \n",
       "my_A 0      135  15\n",
       "     1       26   5\n",
       "my_B 0      674   2\n",
       "     1       25   6"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th></th>\n      <th>0</th>\n      <th>1</th>\n    </tr>\n    <tr>\n      <th></th>\n      <th>match</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th rowspan=\"2\" valign=\"top\">my_A</th>\n      <th>0</th>\n      <td>135</td>\n      <td>15</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>26</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th rowspan=\"2\" valign=\"top\">my_B</th>\n      <th>0</th>\n      <td>674</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>25</td>\n      <td>6</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 55
    }
   ],
   "source": [
    "s = pd.Series(['A135T15,A26S5','B674S2,B25T6'], index = ['my_A','my_B'])\n",
    "pat = '[A|B](\\d+)[T|S](\\d+)'\n",
    "s.str.extractall(pat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "           name1 name2\n",
       "     match            \n",
       "my_A 0       135    15\n",
       "     1        26     5\n",
       "my_B 0       674     2\n",
       "     1        25     6"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th></th>\n      <th>name1</th>\n      <th>name2</th>\n    </tr>\n    <tr>\n      <th></th>\n      <th>match</th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th rowspan=\"2\" valign=\"top\">my_A</th>\n      <th>0</th>\n      <td>135</td>\n      <td>15</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>26</td>\n      <td>5</td>\n    </tr>\n    <tr>\n      <th rowspan=\"2\" valign=\"top\">my_B</th>\n      <th>0</th>\n      <td>674</td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>25</td>\n      <td>6</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 56
    }
   ],
   "source": [
    "pat_with_name = '[A|B](?P<name1>\\d+)[T|S](?P<name2>\\d+)'\n",
    "s.str.extractall(pat_with_name)"
   ]
  },
  {
   "source": [
    "`str.findall`的功能类似于`str.extractall`，区别在于前者把结果存入列表中，而后者处理为多级索引，每个行只对应一组匹配，而不是把所有匹配组合构成列表。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "my_A    [(135, 15), (26, 5)]\n",
       "my_B     [(674, 2), (25, 6)]\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 57
    }
   ],
   "source": [
    "s.str.findall(pat)"
   ]
  },
  {
   "source": [
    "## 四、常用字符串函数\n",
    "\n",
    "除了上述介绍的五类字符串操作有关的函数之外，`str`对象上还定义了一些实用的其他方法，在此进行介绍：\n",
    "\n",
    "### 1. 字母型函数\n",
    "\n",
    "`upper, lower, title, capitalize, swapcase`这五个函数主要用于字母的大小写转化，从下面的例子中就容易领会其功能："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0                 LOWER\n",
       "1              CAPITALS\n",
       "2    THIS IS A SENTENCE\n",
       "3              SWAPCASE\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 58
    }
   ],
   "source": [
    "s = pd.Series(['lower', 'CAPITALS', 'this is a sentence', 'SwApCaSe'])\n",
    "s.str.upper()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0                 lower\n",
       "1              capitals\n",
       "2    this is a sentence\n",
       "3              swapcase\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 59
    }
   ],
   "source": [
    "s.str.lower()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0                 Lower\n",
       "1              Capitals\n",
       "2    This Is A Sentence\n",
       "3              Swapcase\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 60
    }
   ],
   "source": [
    "s.str.title()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0                 Lower\n",
       "1              Capitals\n",
       "2    This is a sentence\n",
       "3              Swapcase\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 61
    }
   ],
   "source": [
    "s.str.capitalize()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0                 LOWER\n",
       "1              capitals\n",
       "2    THIS IS A SENTENCE\n",
       "3              sWaPcAsE\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 62
    }
   ],
   "source": [
    "s.str.swapcase()"
   ]
  },
  {
   "source": [
    "### 2. 数值型函数\n",
    "\n",
    "这里着重需要介绍的是`pd.to_numeric`方法，它虽然不是`str`对象上的方法，但是能够对字符格式的数值进行快速转换和筛选。其主要参数包括`errors`和`downcast`分别代表了非数值的处理模式和转换类型。其中，对于不能转换为数值的有三种`errors`选项，`raise, coerce, ignore`分别表示直接报错、设为缺失以及保持原来的字符串。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0       1\n",
       "1     2.2\n",
       "2      2e\n",
       "3      ??\n",
       "4    -2.1\n",
       "5       0\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 63
    }
   ],
   "source": [
    "s = pd.Series(['1', '2.2', '2e', '??', '-2.1', '0'])\n",
    "pd.to_numeric(s, errors='ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    1.0\n",
       "1    2.2\n",
       "2    NaN\n",
       "3    NaN\n",
       "4   -2.1\n",
       "5    0.0\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "execution_count": 64
    }
   ],
   "source": [
    "pd.to_numeric(s, errors='coerce')"
   ]
  },
  {
   "source": [
    "在数据清洗时，可以利用`coerce`的设定，快速查看非数值型的行："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "2    2e\n",
       "3    ??\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 65
    }
   ],
   "source": [
    "s[pd.to_numeric(s, errors='coerce').isna()]"
   ]
  },
  {
   "source": [
    "### 3. 统计型函数\n",
    "\n",
    "`count`和`len`的作用分别是返回出现正则模式的次数和字符串的长度："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    2\n",
       "1    2\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "execution_count": 66
    }
   ],
   "source": [
    "s = pd.Series(['cat rat fat at', 'get feed sheet heat'])\n",
    "s.str.count('[r|f]at|ee')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    14\n",
       "1    19\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "execution_count": 67
    }
   ],
   "source": [
    "s.str.len()"
   ]
  },
  {
   "source": [
    "### 4. 格式型函数\n",
    "格式型函数主要分为两类，第一种是除空型，第二种是填充型。其中，第一类函数一共有三种，它们分别是`strip, rstrip, lstrip`，分别代表去除两侧空格、右侧空格和左侧空格。这些函数在数据清洗时是有用的，特别是列名含有非法空格的时候。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "Int64Index([4, 4, 4], dtype='int64')"
      ]
     },
     "metadata": {},
     "execution_count": 68
    }
   ],
   "source": [
    "my_index = pd.Index([' col1', 'col2 ', ' col3 '])\n",
    "my_index.str.strip().str.len()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "Int64Index([5, 4, 5], dtype='int64')"
      ]
     },
     "metadata": {},
     "execution_count": 69
    }
   ],
   "source": [
    "my_index.str.rstrip().str.len()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "Int64Index([4, 5, 5], dtype='int64')"
      ]
     },
     "metadata": {},
     "execution_count": 70
    }
   ],
   "source": [
    "my_index.str.lstrip().str.len()"
   ]
  },
  {
   "source": [
    "对于填充型函数而言，`pad`是最灵活的，它可以选定字符串长度、填充的方向和填充内容："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    ****a\n",
       "1    ****b\n",
       "2    ****c\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 71
    }
   ],
   "source": [
    "s = pd.Series(['a','b','c'])\n",
    "s.str.pad(5,'left','*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    a****\n",
       "1    b****\n",
       "2    c****\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 72
    }
   ],
   "source": [
    "s.str.pad(5,'right','*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    **a**\n",
       "1    **b**\n",
       "2    **c**\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 73
    }
   ],
   "source": [
    "s.str.pad(5,'both','*')"
   ]
  },
  {
   "source": [
    "上述的三种情况可以分别用`rjust, ljust, center`来等效完成，需要注意`ljust`是指右侧填充而不是左侧填充："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    ****a\n",
       "1    ****b\n",
       "2    ****c\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 74
    }
   ],
   "source": [
    "s.str.rjust(5, '*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    a****\n",
       "1    b****\n",
       "2    c****\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 75
    }
   ],
   "source": [
    "s.str.ljust(5, '*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    **a**\n",
       "1    **b**\n",
       "2    **c**\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "execution_count": 76
    }
   ],
   "source": [
    "s.str.center(5, '*')"
   ]
  },
  {
   "source": [
    "在读取`excel`文件时，经常会出现数字前补0的需求，例如证券代码读入的时候会把\"000007\"作为数值7来处理，`pandas`中除了可以使用上面的左侧填充函数进行操作之外，还可用`zfill`来实现。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    000007\n",
       "1    000155\n",
       "2    303000\n",
       "dtype: string"
      ]
     },
     "metadata": {},
     "execution_count": 77
    }
   ],
   "source": [
    "s = pd.Series([7, 155, 303000]).astype('string')\n",
    "s.str.pad(6,'left','0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    000007\n",
       "1    000155\n",
       "2    303000\n",
       "dtype: string"
      ]
     },
     "metadata": {},
     "execution_count": 78
    }
   ],
   "source": [
    "s.str.rjust(6,'0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "0    000007\n",
       "1    000155\n",
       "2    303000\n",
       "dtype: string"
      ]
     },
     "metadata": {},
     "execution_count": 79
    }
   ],
   "source": [
    "s.str.zfill(6)"
   ]
  },
  {
   "source": [
    "## 五、练习\n",
    "### Ex1：房屋信息数据集\n",
    "现有一份房屋信息数据集如下："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "      floor    year    area price\n",
       "0   高层（共6层）  1986年建  58.23㎡  155万\n",
       "1  中层（共20层）  2020年建     88㎡  155万\n",
       "2  低层（共28层）  2010年建  89.33㎡  365万"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>floor</th>\n      <th>year</th>\n      <th>area</th>\n      <th>price</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>高层（共6层）</td>\n      <td>1986年建</td>\n      <td>58.23㎡</td>\n      <td>155万</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>中层（共20层）</td>\n      <td>2020年建</td>\n      <td>88㎡</td>\n      <td>155万</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>低层（共28层）</td>\n      <td>2010年建</td>\n      <td>89.33㎡</td>\n      <td>365万</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 80
    }
   ],
   "source": [
    "df = pd.read_excel('../data/house_info.xls', usecols=['floor','year','area','price'])\n",
    "df.head(3)"
   ]
  },
  {
   "source": [
    "1. 将`year`列改为整数年份存储。\n",
    "2. 将`floor`列替换为`Level, Highest`两列，其中的元素分别为`string`类型的层类别（高层、中层、低层）与整数类型的最高层数。\n",
    "3. 计算房屋每平米的均价`avg_price`，以`***元/平米`的格式存储到表中，其中`***`为整数。\n",
    "### Ex2：《权力的游戏》剧本数据集\n",
    "现有一份权力的游戏剧本数据集如下："
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "  Release Date    Season   Episode      Episode Title          Name  \\\n",
       "0   2011-04-17  Season 1  Episode 1  Winter is Coming  waymar royce   \n",
       "1   2011-04-17  Season 1  Episode 1  Winter is Coming          will   \n",
       "2   2011-04-17  Season 1  Episode 1  Winter is Coming  waymar royce   \n",
       "\n",
       "                                            Sentence  \n",
       "0  What do you expect? They're savages. One lot s...  \n",
       "1  I've never seen wildlings do a thing like this...  \n",
       "2                             How close did you get?  "
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Release Date</th>\n      <th>Season</th>\n      <th>Episode</th>\n      <th>Episode Title</th>\n      <th>Name</th>\n      <th>Sentence</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2011-04-17</td>\n      <td>Season 1</td>\n      <td>Episode 1</td>\n      <td>Winter is Coming</td>\n      <td>waymar royce</td>\n      <td>What do you expect? They're savages. One lot s...</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2011-04-17</td>\n      <td>Season 1</td>\n      <td>Episode 1</td>\n      <td>Winter is Coming</td>\n      <td>will</td>\n      <td>I've never seen wildlings do a thing like this...</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2011-04-17</td>\n      <td>Season 1</td>\n      <td>Episode 1</td>\n      <td>Winter is Coming</td>\n      <td>waymar royce</td>\n      <td>How close did you get?</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 81
    }
   ],
   "source": [
    "df = pd.read_csv('../data/script.csv')\n",
    "df.head(3)"
   ]
  },
  {
   "source": [
    "1. 计算每一个`Episode`的台词条数。\n",
    "2. 以空格为单词的分割符号，请求出单句台词平均单词量最多的前五个人。\n",
    "3. 若某人的台词中含有问号，那么下一个说台词的人即为回答者。若上一人台词中含有$n$个问号，则认为回答者回答了$n$个问题，请求出回答最多问题的前五个人。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  }
 ]
}